You will receive marks for correctly submitting this assignment.
To correctly submit this assignment follow the instructions below:
Here you will find the description of each rubric used in MDS.
NOTE: The data you download for use in this lab SHOULD NOT BE PUSHED TO YOUR REPOSITORY. You might be penalised for pushing datasets to your repository. I have seeded the repository with .gitignore and hoping that it won't let you push CSVs.
# Your imports
import os
%matplotlib inline
import string
import sys
from collections import deque
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import xgboost as xgb
# data
from sklearn.compose import ColumnTransformer, make_column_transformer
# Classifiers
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
# classifiers / models
from sklearn.linear_model import LogisticRegression
# other
from sklearn.model_selection import (
GridSearchCV,
RandomizedSearchCV,
cross_val_score,
cross_validate,
train_test_split,
)
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeRegressor, export_graphviz
In this lab you will be working on an open-ended mini-project, where you will put all the different things you have learned so far in 571 and 573 together to solve an interesting problem.
A few notes and tips when you work on this mini-project:
We plan to grade fairly and leniently. We don't have some secret target score that you need to achieve to get a good grade. You'll be assessed on demonstration of mastery of course topics, clear presentation, and the quality of your analysis and results. For example, if you just have a bunch of code and no text or figures, that's not good. If you do a bunch of sane things and get a lower accuracy than your friend, don't sweat it.
Finally, this style of this "project" question is different from other assignments. It'll be up to you to decide when you're "done" -- in fact, this is one of the hardest parts of real projects. But please don't spend WAY too much time on this... perhaps "a few hours" (2-8 hours???) is a good guideline for a typical submission. Of course if you're having fun you're welcome to spend as much time as you want! But, if so, try not to do it out of perfectionism or getting the best possible grade. Do it because you're learning and enjoying it. Students from the past cohorts have found such kind of labs useful and fun and I hope you enjoy it as well.
In this mini project, you will be working on a classification problem of predicting whether a credit card client will default or not. For this problem, you will use Default of Credit Card Clients Dataset. In this data set, there are 30,000 examples and 24 features, and the goal is to estimate whether a person will default (fail to pay) their credit card bills; this column is labeled "default.payment.next.month" in the data. The rest of the columns can be used as features. You may take some ideas and compare your results with the associated research paper, which is available through the UBC library.
Your tasks:
This is a classifier problem. Target column is default.payment.next.month. "1" stands for yes, which is the class we are more interested in. So 'recall' or 'f1' may be appropriate for the scoring method. There are 24 features. All of them are of numeric data type. Some of the columns are binary such as sex, and some of them are categorical features that are already encoded with orders, such as education, so they should be passed through without any transformation. The dataset is large enough with 30,000 examples so it does not seem to be a large concern for optimization bias of the validation set.
The dataset is from Taiwan so any model developed based on the data may not be appropriate for use in other countries. In addition, the data was collected back in 2005 so it could be outdated. The model should be used with caution.
credit_df = pd.read_csv("UCI_Credit_Card.csv", index_col="ID")
credit_df
| LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_0 | PAY_2 | PAY_3 | PAY_4 | PAY_5 | ... | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | default.payment.next.month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID | |||||||||||||||||||||
| 1 | 20000.0 | 2 | 2 | 1 | 24 | 2 | 2 | -1 | -1 | -2 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 689.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 |
| 2 | 120000.0 | 2 | 2 | 2 | 26 | -1 | 2 | 0 | 0 | 0 | ... | 3272.0 | 3455.0 | 3261.0 | 0.0 | 1000.0 | 1000.0 | 1000.0 | 0.0 | 2000.0 | 1 |
| 3 | 90000.0 | 2 | 2 | 2 | 34 | 0 | 0 | 0 | 0 | 0 | ... | 14331.0 | 14948.0 | 15549.0 | 1518.0 | 1500.0 | 1000.0 | 1000.0 | 1000.0 | 5000.0 | 0 |
| 4 | 50000.0 | 2 | 2 | 1 | 37 | 0 | 0 | 0 | 0 | 0 | ... | 28314.0 | 28959.0 | 29547.0 | 2000.0 | 2019.0 | 1200.0 | 1100.0 | 1069.0 | 1000.0 | 0 |
| 5 | 50000.0 | 1 | 2 | 1 | 57 | -1 | 0 | -1 | 0 | 0 | ... | 20940.0 | 19146.0 | 19131.0 | 2000.0 | 36681.0 | 10000.0 | 9000.0 | 689.0 | 679.0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 29996 | 220000.0 | 1 | 3 | 1 | 39 | 0 | 0 | 0 | 0 | 0 | ... | 88004.0 | 31237.0 | 15980.0 | 8500.0 | 20000.0 | 5003.0 | 3047.0 | 5000.0 | 1000.0 | 0 |
| 29997 | 150000.0 | 1 | 3 | 2 | 43 | -1 | -1 | -1 | -1 | 0 | ... | 8979.0 | 5190.0 | 0.0 | 1837.0 | 3526.0 | 8998.0 | 129.0 | 0.0 | 0.0 | 0 |
| 29998 | 30000.0 | 1 | 2 | 2 | 37 | 4 | 3 | 2 | -1 | 0 | ... | 20878.0 | 20582.0 | 19357.0 | 0.0 | 0.0 | 22000.0 | 4200.0 | 2000.0 | 3100.0 | 1 |
| 29999 | 80000.0 | 1 | 3 | 1 | 41 | 1 | -1 | 0 | 0 | 0 | ... | 52774.0 | 11855.0 | 48944.0 | 85900.0 | 3409.0 | 1178.0 | 1926.0 | 52964.0 | 1804.0 | 1 |
| 30000 | 50000.0 | 1 | 2 | 1 | 46 | 0 | 0 | 0 | 0 | 0 | ... | 36535.0 | 32428.0 | 15313.0 | 2078.0 | 1800.0 | 1430.0 | 1000.0 | 1000.0 | 1000.0 | 1 |
30000 rows × 24 columns
train_df, test_df = train_test_split(credit_df, test_size = 0.20, random_state = 123)
train_df.head()
| LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_0 | PAY_2 | PAY_3 | PAY_4 | PAY_5 | ... | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | default.payment.next.month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID | |||||||||||||||||||||
| 19683 | 200000.0 | 2 | 2 | 1 | 46 | 0 | 0 | 0 | 0 | 0 | ... | 103422.0 | 95206.0 | 65108.0 | 3692.0 | 5000.0 | 3300.0 | 2500.0 | 2930.0 | 1500.0 | 0 |
| 11063 | 120000.0 | 2 | 1 | 1 | 32 | -1 | -1 | -1 | -1 | -1 | ... | 476.0 | 802.0 | 326.0 | 652.0 | 326.0 | 476.0 | 802.0 | 0.0 | 326.0 | 1 |
| 198 | 20000.0 | 2 | 1 | 2 | 22 | 0 | 0 | 0 | 0 | -1 | ... | 8332.0 | 18868.0 | 19247.0 | 1500.0 | 1032.0 | 541.0 | 20000.0 | 693.0 | 1000.0 | 0 |
| 23621 | 100000.0 | 2 | 5 | 2 | 34 | 0 | 0 | 0 | 0 | 0 | ... | 23181.0 | 7721.0 | 3219.0 | 5004.0 | 3811.0 | 3002.0 | 4000.0 | 3219.0 | 1864.0 | 0 |
| 26032 | 290000.0 | 2 | 2 | 2 | 29 | 0 | 0 | 0 | 0 | 0 | ... | 8770.0 | 9145.0 | 10016.0 | 1130.0 | 1502.0 | 1300.0 | 500.0 | 1000.0 | 1001.0 | 0 |
5 rows × 24 columns
Your tasks:
train_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 24000 entries, 19683 to 19967 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 LIMIT_BAL 24000 non-null float64 1 SEX 24000 non-null int64 2 EDUCATION 24000 non-null int64 3 MARRIAGE 24000 non-null int64 4 AGE 24000 non-null int64 5 PAY_0 24000 non-null int64 6 PAY_2 24000 non-null int64 7 PAY_3 24000 non-null int64 8 PAY_4 24000 non-null int64 9 PAY_5 24000 non-null int64 10 PAY_6 24000 non-null int64 11 BILL_AMT1 24000 non-null float64 12 BILL_AMT2 24000 non-null float64 13 BILL_AMT3 24000 non-null float64 14 BILL_AMT4 24000 non-null float64 15 BILL_AMT5 24000 non-null float64 16 BILL_AMT6 24000 non-null float64 17 PAY_AMT1 24000 non-null float64 18 PAY_AMT2 24000 non-null float64 19 PAY_AMT3 24000 non-null float64 20 PAY_AMT4 24000 non-null float64 21 PAY_AMT5 24000 non-null float64 22 PAY_AMT6 24000 non-null float64 23 default.payment.next.month 24000 non-null int64 dtypes: float64(13), int64(11) memory usage: 4.6 MB
From the above information, we can see that there are no missing value in the dataset.
train_df['default.payment.next.month'].value_counts(normalize=True)
0 0.777833 1 0.222167 Name: default.payment.next.month, dtype: float64
We have class imbalance because class "1" takes up only 22.21% of the entire train_df. Since class "1" is what we are interested in classifying, I would go with f1 score.
import altair as alt
alt.data_transformers.disable_max_rows()
alt.Chart(train_df, title = "Age v. Credit Limit").mark_point().encode(alt.X("AGE", title="Age"), alt.Y("LIMIT_BAL", title="Limit"))